Extracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem

Authors

  • M. Ahmadzadeh Faculty of computer and IT Engineering, Shiraz University of Technology, Shiraz, Iran.
  • S. Miri Rostami Faculty of computer and IT Engineering, Shiraz University of Technology, Shiraz, Iran.
Abstract:

Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue for researchers. This study aims to develop a predictive model for 5-year survivability of breast cancer patients and discover relationships between certain predictive variables and survival. The dataset was obtained from SEER database. First, the effectiveness of two synthetic oversampling methods Borderline SMOTE and Density based Synthetic Oversampling method (DSO) is investigated to solve the class imbalance problem. Then a combination of particle swarm optimization (PSO) and Correlation-based feature selection (CFS) is used to identify most important predictive variables. Finally, in order to build a predictive model three classifiers decision tree (C4.5), Bayesian Network, and Logistic Regression are applied to the cleaned dataset. Some assessment metrics such as accuracy, sensitivity, specificity, and G-mean are used to evaluate the performance of the proposed hybrid approach. Also, the area under ROC curve (AUC) is used to evaluate performance of feature selection method. Results show that among all combinations, DSO + PSO_CFS + C4.5 presents the best efficiency in criteria of accuracy, sensitivity, G-mean and AUC with values of 94.33%, 0.930, 0.939 and 0.939, respectively.

Upgrade to premium to download articles

Sign up to access the full text

Already have an account?login

similar resources

Breast Cancer Diagnosis from Perspective of Class Imbalance

Introduction: Breast cancer is the second cause of mortality among women. Early detection is the only rescue to reduce the risk of breast cancer mortality. Traditional methods cannot effectively diagnose tumor since they are based on the assumption of well-balanced dataset.. However, a hybrid method can help to alleviate the two-class imbalance problem existing in the ...

full text

Handling class imbalance problem in miRNA dataset associated with cancer

MiRNAs are small (~22nt long) non-coding RNA sequences; binds to the complementarity target sites in 3' Untranslated Region (UTR) of mRNA sequences but not restricted to other mRNA regions viz., 5' UTR and Coding sequences (CDS). Complementarity binding of miRNA to mRNA target sites either results in complete degradation of the mRNA itself or it may regulate the mRNA as an oncogene or as a tumo...

full text

Classification with class imbalance problem: A Review

Most existing classification approaches assume the underlying training set is evenly distributed. In class imbalanced classification, the training set for one class (majority) far surpassed the training set of the other class (minority), in which, the minority class is often the more interesting class. In this paper, we review the issues that come with learning from imbalanced class data sets a...

full text

Class Imbalance Problem in Data Mining Review

In last few years there are major changes and evolution has been done on classification of data. As the application area of technology is increases the size of data also increases. Classification of data becomes difficult because of unbounded size and imbalance nature of data. Class imbalance problem become greatest issue in data mining. Imbalance problem occur where one of the two classes havi...

full text

The Class Imbalance Problem: Signiicance and Strategies

Although the majority of concept-learning systems previously designed usually assume that their training sets are well-balanced, this assumption is not necessarily correct. Indeed, there exist many domains for which one class is represented by a large number of examples while the other is represented by only a few. The purpose of this paper is 1) to demonstrate experimentally that, at least in ...

full text

Handling Class Imbalance Problem Using Feature Selection

1 Introduction The class imbalance problem is a challenge to machine learning and data mining, and it has attracted significant research recent years. A classifier affected by the class imbalance problem for a specific data set would see strong accuracy overall but very poor performance on the minority class. The imbalance data sets are pervasive in real-world applications. Examples of these ki...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}


Journal title

volume 6  issue 2

pages  263- 276

publication date 2018-07-01

By following a journal you will be notified via email when a new issue of this journal is published.

Hosted on Doprax cloud platform doprax.com

copyright © 2015-2023